Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

mulated genes and with varying numbers of DEGs from 2,000 to

can be seen that both sets of parameters converge quickly.

lgorithm is iterated till the convergence of model parameters or

the maximum learning cycle. Afterwards, the null density and the

e density are estimated for each gene. The Bayes rule is used to

e whether a gene is a DEG.

SG for simulated data DEG discovery

ted data set of 900 non-DEGs and 100 DEGs was designed [Al

2015]. Only one replicate was used. The design was composed of

gments. All non-DEGs were designed to follow a Gaussian

on centred at zero with a unit standard deviation. In total, 50

gulated DEGs were designed to follow a Gaussian distribution

t negative five with a standard deviation two. In addition, 50 up-

DEGs were designed to follow a Gaussian distribution centred

ith a standard deviation two. Figure 6.49(a) shows the estimated

or this simulated data set. It can be seen that the alternative density

y wide distribution compared with the null density. Figure 6.49(b)

e ROC curve for this DSG model. The AUC value was 0.996.

ext thing to be investigated was whether the alternative standard

caused a difference in DEG discovery using DSG. Rather than

two, the standard deviation for both down-regulated DEGs and

ated DEGs was varied from one to five. Therefore, the overlap

the null density of DEGs and the alternative density of non-DEGs

d in this trial. For each standard deviation, 50 DSG models were

ed and the Jackknife test was used to evaluate these 50 models.

n AUC was calculated for the evaluation. Thus, how model

nce (AUC) varied with the overlap between the non-DEG

on (the null density) and the DEG distribution (the alternative

was examined. Figure 6.50 shows the result. It can be seen that

l performance (AUC) of DSG was deteriorated slightly when the

etween the null density and the alternative density was increased.

ot a surprise at all.